Context models on sequences of covers 
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Abstract 

We present a class of models that, via a simple construction, enables 
exact, incremental, non-parametric, polynomial-time, Bayesian inference 
of conditional measures. The approach relies upon creating a sequence 
of covers on the conditioning variable and maintaining a different model 
for each set within a cover. Inference remains tractable by specifying the 
probabilistic model in terms of a random walk within the sequence of cov- 
ers. We demonstrate the approach on problems of conditional density es- 
timation, which, to our knowledge is the first closed-form, non-parametric 
Bayesian approach to this problem. 



1 Introduction 

Conditional measure estimation is a fundamental problem in statistics. Specific 
instances of this problem include classification, regression and conditional den- 
sity estimation. This paper formulates a general approach for non-parametric, 
incremental, closed-form Bayesian estimation of conditional measures that re- 
lies on a model structure defined on a sequence of covers. This is an important 
development, particularly for the problem of conditional density estimation, 
where although non-parameteric kernel-based approaches that currently domi- 
nate generally perform well, a fast, tractable, incremental, Bayesian approach 
has been lacking. 

This construction used in this paper employs a random walk in a set of con- 
texts. In its simplest form, this can be seen as a descendant of context tree 



methods for variable order Markov models (Willems et al. 1995 Dimitrakakis 



2010) and Bayesian non-parametric methods for tree-based density estimation 
approaches (Hutter 2005 Wong and Ma 2010). These approaches utilise a 



stopping variable construction on a tree to simplify inference. The central con- 
tribution of this paper is to generalise this to a terminating random walk on 
a lattice. Then the inference procedure remains tractable, while the lattice 
structure increases the flexibility and applicability of the model. As an exam- 
ple, the proposed framework is applied to the important problem of conditional 
density estimation, obtaining the first closed-form, incremental, non-parametric 
Bayesian approach to this problem. 



Stated generally, the problem of incremental, conditional measure estimation 
in a Bayesian setting is as follows. We observe the sequences x* = (xi : i = 1, . . . , t) 
and y* = (j/j : i ~ 1, . . . , i), with Xi G X and yi e y. Informally, our goal is the 
prediction of the next observation yt+i given the next conditioning variable 
Xt+i and all previous evidence x*,y*. More precisely, we wish to calculate the 
probability measure: 

My\^t+i)^P{yt+i(^Y\x*+\y') (1) 

for all Y e ^y, where 03 j; denotes the Borel sets of 3^, through Bayes' theorem. 
The main idea we use to tackle this problem is to first define a sequence of 
covers on the space of all sequences x* . Each cover is a collection of sets, such 
that for any sequence x* there exists at least one set c in every cover containing 
that sequence. In addition, each set M corresponds to a model (f)M on y. In 
order to combine these, we introduce a random variable St € '^, such that 
Ct(M I xt+i) = F{St = M I x*+\ y*), is the probability of the model c/jm- Then 
the conditional measure: 

MY I x,+i) = J2 ^MiY I Xt+iMM I x,+i), (2) 

can be readily obtained via marginalisation over the set of contexts. 

We show that via the sequence of covers, ^ can be specified in terms of a 
random walk. This allows closed-form, incremental inference to be performed in 
polynomial time for conditional dcnsitiy estimation and variable order Markov 
models, by selecting the covers appropriately. The resulting class of models 
allows the introduction of several other interesting model classes. 

2 Context models 

We first introduce some notation and basic assumptions. Unless otherwise 
stated, we assume that all sets X are measurable with respect to some u-algebra 
*8;t- We denote sequences of observations Xi e A' by x* = {xi : i = 1, . . . ,t). 
The set X'^ — {0} contains only the null sequence 0, while X^ = x"X de- 
notes the sequences of length n and the set of all sequences is denoted by 
X* = U^o'^"- Finally, we denote the length of any sequence x G X* by 
i{x) such that xe A"^(^). 

A cover C of some set A is a collection of sets such that Umgc M D A. A 
refinement C of C is a cover oi A such that for any M' S C , there is some M G C 
such that M' C M . We consider models constructed on a sequence of covers 
€ = (Cfe : fc = 1, . . .) of X* . Letting "^ = IJ^ Ck be the collection of all subsets 
in our sequence of covers, we refer to each subset M € C as a context. Partition 
trees, where each cover is disjoint and a refinement of the previous cover, are 
an interesting special case: 

Example 1 (Binary alphabet). Let X = {0, 1}. For k = 1,2,..., let Ck be the 
partition of X* into 2^^^ subsets, with the following property. For all M G Ck, 



and any a, b E X* : a,b E M if and only ifae(^a)-i = i't(b)-i for alii = 0, ... , k—l. 
This creates a sequence of partitions based on a suffix tree and can be used in 
the development of variable order Markov models. 

Example 2 (Unit interval). Let X — [0, 1]. For fc = 1, 2, . . ., letCk be the parti- 
tion of X* into 2^-^ subsets, Mk.i ^ {x E X* : x^^^) E [2''-^{i - l),2^-^i)]. A 
generalised form of this sequence of covers is used in the construction of condi- 
tional density estimation using the proposed construction, and shall be the main 
focus of the current paper. 

We now describe a conditional measure on y indexed by X* defined on 
such a structure. This will form the basis for conditional measure estimation. 
Intuitively, the structure defines a set of probability measures on 3^, indexed by 
the set of all contexts. The structure is such that, for any x E X there is only 
one corresponding context f{x), even if there are many contexts containing x. 
The contexts themselves have the property that the corresponding context for 
any x in the set they define is either the same context or one of the subsequent 
contexts in the sequence of covers. This will be useful later, since it will allow 
us to perform closed form inference on a distribution of context models. 

Definition 1. A context model ^ — {V,f) defined on a (countable) sequence of 
covers C = (C^ : fc = 1, . . . , ) of X* , is composed of: 

1. A set V of "local" probability measures on y, conditional on X* and in- 
dexed by elements in the set of contexts '£ = lJj,Cfc.' 

V^{pm{-\x):Me'^}, xeX*. (3) 

2. A context map f : X* — >■ C such that Va; E X* , if f{x) E Ck, then for any 
x' E f(x) it holds that f(x') n f{x) ^ and f{x') E Ck+h with h>0. 

The model fi specifies the following conditional measure on y for any x E X* : 

¥^iY\x)^Pf(,){Y\x), Ycy. (4) 

Though the local measures V can be simple, so that inference can be effi- 
cient, the model's overall complexity will depend on the context map and cover 
structure. 

We now describe a distribution of such models, whereby exact Bayesian in- 
ference can be performed in polynomial time. Intuitively, the distribution can 
be seen as a two-stage process. Firstly, we sample a context map / from a 
set of context maps J-", through a halting random walk on the set of all con- 
texts. Secondly, for each context M we sample a conditional measure Pm from 
a distribution (/)j\/. The construction of and sampling from this distribution, 
are discussed in Sec|2.1[ while Sec. 2.2 shows how to sample from marginal 



distribution ^ and Sec. |2.3| derives the inference procedure. 



2.1 Construction of the context model distribution 

Definition 2 (Cover model). A cover model defines a distributilateon ^ on 
context models fj, ~ ("P, /), through a tuple (£, W, V, $), where: 

1. € = (Cfe : A; = 1, . . .) is a sequence of covers, and ^^ = {M G C : C G £} is 
the set of all contexts in each cover. 

2. yV ~ {wm ■ M G '^#'}, with wm G [0, 1], is a set of stopping probabilities. 

3. V — {vm '■ M G "^l is a set of transition probability vectors, such that: 
II'I'mIIi = 1, and that if M Q Ck, then vm.n G [0, 1] for all N G Ck-i such 
that A^ n M 7^ while vm.n — otherwise 

4. ^ ~ {4>M '■ M G '^}, is a set of priors such that each (J)m is a probability 
measure on liy^x, where 'Sy\x — {pe{' \ x) : 9 £ 0} is a set of probability 
measures on y, conditional on x € X and parameterised in O. 

In order to sample a context model ji ~ ('P,/) from ^ ~ (£, W, V, $), we 
draw V directly from $, while we construct / via two auxiliary variables wm, "Vm 
drawn respectively from a Bernoulli and a multinomial distribution: 

Pm ^ 0M (5a) 

Wm ~ 'Bem{wM) (5b) 

ij^i ~ MuCt{vM)- (5c) 

These draws are performed independently for all M G '^. The construction of 
/ relies on the cover structure. For any x* G X* , we denote the collection of 
contexts at depth k containing x* by 

4 = {M G Cfc : a;* G M) . (6) 

We then define the context map / as follows: /(x*) = Af G C^, if and only if 
Wm — 1 and uTm = for all N E Cf^ with h < k. 

2.2 Drawing samples from the marginal distribution 

In order to generate an observation in y from the marginal distribution derived 
from ^, we can perform the following random walk. 

Definition 3 (Marginal samples). We perform a random walk on the sequence 
of covers € == {Ck : k ^ 1, . . . ,d), with parameters W, V, generates a random 
sequence Si, ... , Sk, with K E {1, . . . ,d\, such that at each stage k, 

1. Sk G Cd+i-k for all k. 

2. With probability ws^. , the walk stops and we generate a local model (j) from 
TpSk '^''^d, subsequently an observation x from (j). 

3. Otherwise, Sk+i ~ N with probability ws^.W; for all N G Cd+k- 



2.3 Inference 

At time t, we have observed x* = {xi : i = 1, . . . ,t) and y* = {iji : i — 1, . . . ,t) 
our model now has parameters Wt,Vt, describing a distribution over context 
models. We wish to update these parameters in the light of new evidence 
Xt+i, yt+i- The main idea is to use a random walk that halts at some context Mt, 
in order to marginalise over context models. By definition, for any observation 
sequence, there is at least one context containing x* in every cover Cfc. We 
denote the collection of those contexts by C],, as in Q. 

We start each stage k of the walk at a context Sk = N G C^ and proceed to 

fc - 1, fc - 2, . . . , 1. Let S* = ^Mt G Uj=i C'j} denote the event that the walk 

stops in one of the first k stages. Then, with probability w\i, we generate the 
next observation from the context N £ Cl, so that yt+i \ a;*+^ ~ 0^(- | xt+i). 
Otherwise, we proceed to the next stage, k + 1, by moving to context K e C^_|.i 
with probability v\j j^. More precisely: 

<,^ ^ nSk-i =K\Sk = N, x'), (7) 

w% = PiMt e Cfc I Sk = N,Blx'). (8) 

The central quantity for tractable inference in this model is the marginal 
prediction given the event B\, for which we can obtain the following recursion: 

V4(yt+ikt+i)=P(ym|5fc-Af,Bi.,a:*+i) 

= <0*v(yt+i|a:t+i) + (1 - w%)V{yt+i\Sk^N,Bi_^,x'+^), (9) 

noting that if we do not stop at level k then Bl._^ is trivially true, or more 
precisely, if B\ and Mt ^ C|. then B\_^. Furthermore, it is easy to see that: 

nvi+i I ^fc = iv,i3*_i,^*+') = E^^(y*+i I ■^t+iX.i^- (10) 

We can now calculate the stopping probabilities w and the transition probabil- 
ities V given the new evidence as follows: 

Theorem 1. Given a set of stopping parameters Wt = {w\,j : M G '^}, a set of 
transition parameters Vt = {v^^ j^ : K G Ck, N G Ck-i, fc = 1, . . .} and a set of 
local measures on X: {(/)^ : K G ^}, then the parameters at the next time step 
are given by: 



t+i _ ^NJyt+l I Xt+l)v%^K 



(11) 



and 



t+i i^lJyt+i I xt+i)wlj 

^Uvt+i I Xt+i)wl,+P{yt+i\x^+\Sk^M,Bl_,){l-wl,)' 



„t+l _ ^k\yt+l I ■^t + 17'"Af f.r,^ 

^M — ^i.tt.. I ^ \...t , ni.. i^*+i o _i,j r>t \n ...t \' V^-^J 



where tp is given by ^ , while (j)\,i is a marginal measure conditioned on the first 
t observations for which M is reachable by the random walk. 



Proof. The proof mainly follows straightforwardly from the previous develop- 
ment. From Bayes theorem and ([7]), we obtain the recursion: 






nvt+i I Sk+i^N,Sk=M,y\x'+^)vlt^ 



Since the random walk 5*^ is first order [^ 



^M,N 



Ec^xt+i I Sk+i^M,Bk-i,xt)vl 



N 



while finally from Q we obtain the required result. The recursion for w*^^ is 



proven analogously to Theorem 1 in (Dimitrakakis 2010). D 



2.4 Complexity 

As previously mentioned, the overall complexity of the model depends on how 
the sequence of covers is constructed. The more dense the covers are, the higher 
the computational complexity. In the worst case scenario, all contexts are reach- 
able by the a random walk, bringing complexity to linear in the number of total 
contexts. More generally, we can relate the complexity to the growth C of the 
number of sets containing each sequence x £ X* as the number of covers d 
increases. 

Lemma 1. Let the sequence of covers be of length d. For any x G X* , let 

Ckix) — {M (z Ck '■ X £ M} be the set of contexts containing x in the cover C^ 
and let \Ckix)\ be the number of contexts in Ck{x). If there exists C > such 
that, for any x G X* 

|Cfc+i(x)|<C|Cfc(x)|, 

then the number number of reachable contexts is bounded by O I ^ ^_^ 1 . 
Proof. The proof follows trivially by the geometric sequence. D 

3 Applications 

The class contains both variable order Markov models and mixtures of fc-order 
Markov models on discrete alphabets, as well as density estimators and condi- 
tional density estimators. All that is required in order to apply the method to 
various cases is to select the context structure and the priors on the random 
walk, stopping probabilities, appropriately. 



^We note that a higher order random walk on S^ is possible, but we do not consider it in 
this paper. 



3.1 Variable order Markov models 



In the variable order Markov class, the sequence of covers is defined such that 
the random walk starts from the finest refinement and proceeds to the coarsest 
one. More specifically, consider a sequence of covers such that each cover is 
a partition. Let Ck be a partition of X^ and let fk ■ X'' o Ck such that 
for each x*^ S X'', there exists ,fk{x^) G Ck- Let a ^ b denote the fact that 
a is a suffix of b and let F(x) = {x' e A"* : x ^ x'} be the set of sequences 
for which x is a suffix. Then Ck — {F{x) : x e A"*^}. This could be an n-ary 
partition tree, or more specifically, a suffix tree, if \X\ = n. In that case, there 
would be only stopping probability parameters w and no transition parameters 
V, since in a suffix tree, each node has at most one child that contains a:* for any 
time t. The local models 



can be defined via Dirichlet priors (DeGroot 1970 
Sec. 9.8) on y. In the binary case, this corresponds to Example [Ij In particular 
the defined variable order Markov model is identical to the formulation given 



in ( Dimitrakakis 2010) and a generalisation of (Willems et al. 1995) 



3.2 Conditional density estimation 

In conditional density estimation, a simple way to generate the sequence of 
covers is to use a kd-tree to create sequence of partitions of X. However, other 
methods, such as a cover tree are easily applicable. As in the variable Markov 
model case, the random walk starts from the finest cover (which corresponds to 
the deepest part of the tree) and is subsequently coarsened. One particularly 
interesting use of the fiexibility offered by transition probabilities here is to 
define multiple density estimators at each context. 

For the density estimators in each context, we specifically consider two alter- 



natives. Firstly, a Normal- Wishart conjugate prior (DeGroot 1970 Sec. 9.10). 
This is a classical Bayesian estimator, which can be updated in closed form. 
Secondly, a Bayesian tree density estimator that straightforwardly extends [Hut- 



ter 



(2005) from densities on the [0,1] interval to densities on [0,1]" through a 

Consequently, 



kd-tree. These alternatives are selected via the random walk 
inference is performed on a double pseudo-tree. 



4 Related work 



Among other things, the presented model relies upon a marginalisation over a 
finite number of contexts for tractable inference. Similar mechanisms have of 
course appeared before. It is nevertheless worthwhile to note two recent models 



proposed in (Wong and Ma, 2010 Hutter 20051, which are directly appfied to 



density estimation on X. There, the selection of a context M can be seen as 
a walk starting from the root node of a tree, which corresponds to the whole 
of X and proceeding to a matching child node, which is one of the subsets of 
the root note, stopping with some probability. These models are not trivially 
applicable to conditional density estimation, apart from the (perhaps naive) 
approach of estimating p{x,y),p{x) separately and using their ratio. On the 



other hand, they can naturally be incorporated within our framework by using 
them as optional sub-models performing density estimation in each context. 
In the context of variable order Markov model estimation, a related con- 



struction was presented in ( Dimitrakakis 2010). There, the process can be seen 



as a walk starting from the leaf node of a suffix tree, stopping with some proba- 
bility, otherwise proceeding to the parent node. The same structure is implicitly 



present in the classic context treee weighting method ( Willems et al. 1995 ). The 
proposed framework can be seen as an extension of those two methods where 
the context structure is not limited to a partition tree. 

Most of the work on conditional density estimation has focused on kernel 
based methods and tree methods. For example, recently an approximate kernel 



conditional density estimation ( Holmes et al. 2008 1 has been developed which 



employs a double tree structure for efficient estimation of the kernel bandwidth. 
Finally, a set of tree models for conditional density estimation are surveyed in 



(Scott Davies 2002). However, none of these methods is fully Bayesian, in the 
sense that a distribution on models is not maintained. Rather, a single tree 
model is selected after all the data has been seen. In that sense, the approach 
suggested in this paper has the additional advantage of being incrementally 
updatable in closed form. 

Finally, it is worth mentioning the related problem of estimating conditional 



probabilities in a large (but finite) sets. For this problem, Beygelzimer et al. 



(2009) propose and analyse an efficient, incremental tree-based method. 



5 Numerical experiments 

We examined the algorithm on a number of conditional density estimation do- 
mains. As previously mentioned in Sec. |3.2[ we used a double pseudo-tree 
structure, with optional Normal- Wishart conjugate priors for modelling densi- 
ties. The prior weights were set to 2~'' for contexts at depth k in order to favour 
short trees, while all transition probabilities were initially uniform. Since infer- 
ence is closed form, we can update all parameters according to Theorem [T] In 
order to generate the covers efficiently, we construct a set of kd-tree structures 
online. That is, once more than 9k observation are within a leaf node at depth 
k, the node is partitioned along its largest dimension. It is easy to see that the 
(pseudo) tree depth, and consequently the complexity of the method depends 
on the choice of 9k ■ 

Lemma 2. For a total ofT observations and 9k — a^ , a > ^ , the tree depth is 
hounded by O (log^, T{a — 1)) and 51 (log„^ T{a/3 — 1)), where (3 is a branching 
factor. 

Proof. Let us first consider the upper bound. The depth is maximal when the 
deepest leaf node is reached for every observation. Consequently, 



^-E 



d ,rf+l 



k "^ - 1 

a — 



a-1 

fe=0 



and so d = log„[l + (a — 1)T] — 1. We can obtain a lower bound by examining 
the case where the tree is balanced. Then the number of nodes at depth k is 
then Nk — P'' and consequently: 

d d 

T = Y,Nueu=Y,{aP)\ 



fc=0 



A;=0 



and so d = log„.[l + {a(i - 1)T] - 1. 



D 



Using this lemma, it is easy to see that the total complexity is O (TlogT), 
thus only slightly worse than linear. 

5.1 An illustration 





(a) 10^ observations 



(b) 10^ observations 





(c) 10^ observations 



(d) 10® observations 



Figure 1: Conditional density estimation illustration on a Gaussian ring distri- 
bution. It can be seen that the estimator settles on a Gaussian density near the 
edges, where the distribution is approximately normal, while uses a pseudo-tree 
distribution near the ring. The structure is refined with subsequent observa- 
tions. 



Figure [T] demonstrates the context model estimator on a ring Gaussian dis- 
tribution from which samples were generated as follows. Firstly, the mean of 



Name 


X 


y 


training 


holdout 


Gaussian mixture 


R 


R 


10^ 


10^ 


Uniform mixture 


R 


R 


10*^ 


10^ 


Geyser 


R 


R 


200 


72 


Robot 


R^«^ 


R« 


2812 


2644 



Table 1: Summary of datasets 

a Gaussian was drawn by sampling an angle d from a mixture of univariate 
Gaussians. The observation was then drawn from a bivariate Gaussian with 
mean equal to the location on a unit ring determined by the drawn angle. Con- 
sequently, near and within the ring, the distribution is highly non-Gaussian, 
while further away from the ring the distribution approaches normality. This 
is borne out in the figure, since, while in far-away regions, the distribution is 
modelled with a smooth Gaussian, close to the ring, even for a limited number of 
samples, the parts of the model which correspond to non-Gaussian distributions 
have a higher probability. 



5.2 Comparisons 

We compared our method with a double-kernel conditional density estimator 
utilising cross-validation for bandwidth selection. This is effectively a Parzen 
window estimator combined with a kernel density estimator. Although such 
methods are generally robust, they suffer from two drawbacks. The first is the 
computational complexity especially in terms of the bandwidth selection for the 
two kernels. This is something addressed by Holmes et al. (20081, which uses a 



double tree structure to accelerate the search. The second and most important 
drawback is that the bandwidth estimator is invariant. This may potentially 
create problems, since ideally one would like to vary the kernel in different parts 
of the space. For our quantitative experiments, we utilised a Gaussian kernel 
throughout for the kernel estimators. 

The experimens were performed on a number of datasets, summarised in 
Table [T] The first two are large, synthetic datasets. The Gaussian mixture 
dataset is a mixture of three Gaussian distributions on R^, where the first 
dimension is used as the conditioning variable. Similarly the Uniform mixture 
dataset is a mixture of three uniform distributions. We also have results from 
two real datasets. The first. Geyser, is the well-known dataset of eruption times 
and durations for the "old faithful" geyser. The second dataset. Robot is a set 
of proximity sensor readings from a robot performing a navigation task. 

For each dataset, we measured the average negative log-likelihood of each 
method as the amount of training data increased. Each dataset D was split 
into a training set Dt and hold-out set Dh- For each method, we obtained a 
sequence of conditional density models pt, trained on the subset Dt C Dt of the 
first t observations in the training set and then calculated the average negative 



10 




observations 

(a) Gaussian mixture 




observations 

(b) Uniform mixture 





(c) Geyser 



(d) Robot 



Figure 2: Conditional density estimation performance on a hold-out set, for four 
different datasets as the number of observations t increases. The performance 
is in terms of the relative log loss Lt or average negative log-likelihood of the 
hold-out set. In most cases, the context cover double pseudo-tree significantly 
outperforms a bandwidth-tuned kernel estimator. 



11 



log-likelihood of that model on a hold-out set D/,,: 

For the cover method, we employed the same settings as in the previous 
experiment. For the double-kernel method, for each training subset Dt, we 
employed 10-fold cross-validation to select the bandwidths of the two kernels 
and then used the chosen bandwidths to obtain a model on the full subset Df. 
The criterion for choosing the bandwidth was the likelihood on the left-out folds. 

Figure [2] compares the performance of our model with a double-kernel con- 
ditional density. One would expect the kernel method to perform best in the 
Gaussian mixture dataset, while the cover method would be favoured in the 
uniform mixture. This however, is clearly not the case. Firstly, note that the 
cover method can optionally use a Normal- Wishart distribution to model the 
density at any part of the space. Thus, the pure Gaussian kernel has no initial 
advantage. Secondly, some parts of X have much fewer samples and so would 
require a much wider kernel for accurate estimation. However, the use of an 
invariant kernel means that this is not possible. In the uniform mixture dataset, 
the kernel method is almost as well as the cover method, though it is initially 
disadvantage due to the bad fit of the Gaussian kernel to the uniform blocks. In 
the widely-used, although extremely small. Geyser dataset, it can be seen that 
the kernel method dominates the cover one. However, the difference is quite 
small and the size of the dataset is such that the performance of the method 
is mainly dependent upon how well its prior assumptions match the dataset. 
Finally, in the Robot dataset, which is high-dimensional but has only a moder- 
ate number of observations, the methods are more or less evenly matched. The 
initially bad performance of the kernel method is mainly due to the fact that it 
is hard to choose a good bandwith from only 100 samples in a high-dimensional 
space. 

Overall, one may observe that the two methods usually perform mostly sim- 
ilarly. However, the cover method appears to be more robust and in some 
cases its asymptotic performance is significantly better than that of the kernel 
method. 



6 Conclusion 

We outlined an efficient, online, closed-form inference procedure for estimation 
on a sequence of covers. It can be seen as a direct extension of a previous 



construction ( Dimitrakakis 2010), which was limited to partition trees and an 
analogous procedure for density estimation on partition trees, given by |Hutter| 
([2005j). 

In principle, the approach is applicable to any problem involving estimation 
of conditional measures, such as classification and variable order Markov model 
estimation. As an example, we applied it to conditional density estimation, a 
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fundamental problem in statistics. The result is the first, to our knowledge, 
closed-form, incremental, polynomial-time, Bayesian conditional density esti- 
mation method. 

In order to do this, we utilised a double pseudo-tree structure. The first 
part of the structure was used to estimate the conditional probabilities of context 
models. The second part of the structure was used to estimate a density for each 
context. This resulted in a procedure for closed-form, Bayesian, non parametric 
conditional density estimation. As expected, the performance of this method 
was in some cases significantly better than that of a kernel based estimator with 
an invariant kernel. 

In future work, we would like to consider other density estimators for the 
local context models. Since there are virtually no restrictions regarding their 
type (other than the ability for incremental conditioning), using kernel den- 
sity estimators on each context instead, could be a route towards obtaining 
non-invariant kernel density estimation methods. In addition, it would be inter- 
esting to consider problems where we have some prior information regarding the 
smoothness of the underlying conditioning density, perhaps in terms of Lipschitz 
conditions with respect to the conditioning variable. 

The main open problem is how to generate the covers. In this paper, 
we utilised a kd-tree to do so. However, the generality of the approach is 
such that many other more interesting alternatives are possible. For example. 



cover trees ( Beygelzimer et al. 2006 1 , which are an extremely efficient nearest- 
neighbour method, are an ideal alternative. This alternate structure, would 
allow the application of cover models to an arbitrary metric space. In addition, 
inference on any lattice structure should remain tractable. 

Nevertheless, the problem of finding a suitable sequence of covers remains. 
This is more pronounced for controlled processes, because one cannot rely on 
the statistics of the observations to create a useful cover. This problem can be 
circumvented if a distribution on covers is maintained, which would be more 



in the spirit of the optional Polya tree (Wong and Ma 2010). However, then 
inference would no longer be closed form. 
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